AITopics | data mining

Collaborating Authors

data mining

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Interview with Thi Kieu Khanh Ho: Time-series anomaly detection

AIHubJul-9-2026, 09:01:51 GMT

The latest interview in our series with the AAAI/SIGAI Doctoral Consortium participants features Thi Kieu Khanh Ho who is studying time-series anomaly detection. We found out more about her research, and what inspired her to study AI, and what she plans to work on next. Tell us a bit about your PhD -- where are you studying, and what is the topic of your research? I am doing my PhD at McGill University and Mila - Québec AI Institute, in the Department of Electrical and Computer Engineering, supervised by Professor Narges Armanfard. My research focuses on time-series anomaly detection, the problem of teaching AI systems to recognize when something unusual or abnormal is happening in complex, real-world data streams, without relying on large amounts of labeled examples.

anomaly detection, artificial intelligence, data mining, (14 more...)

AIHub

Country: North America > Canada > Quebec > Montreal (0.25)

Genre: Personal (0.70)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.71)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence (1.00)

Add feedback

Cloudflare will filter out web crawlers that serve AI companies

EngadgetJul-2-2026, 21:17:10 GMT

The hosting platform wants sites to have more control over how AI companies use their content. Cloudflare has announced plans to automatically block mixed-use web crawlers that index websites for search engines and act as AI agents and trainers at the same time. The company previously offered its customers the optional ability to prevent crawlers from scraping their sites for AI chatbots, but now Cloudflare's stance is becoming more defensive by default. Now that the majority of traffic on the Internet is non-human, we must go further and act faster so that a sustainable ecosystem can emerge, Matthew Prince, Cloudflare's CEO and co-founder shared in a statement. Cloudflare's new tools and partnerships give website owners increased visibility and commercial opportunities and benefit AI companies that have bots with clear and transparent intent.

artificial intelligence, data mining, natural language, (14 more...)

Engadget

Industry: Leisure & Entertainment > Games > Computer Games (0.71)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science > Data Mining > Web Mining (0.65)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.59)

Add feedback

Policy Optimization Achieves Data-Dependent Regret Bounds in MDPs with Unknown Transitions

Li, Mingyi, Tsuchiya, Taira, Yamanishi, Kenji

arXiv.org Machine LearningJul-1-2026

We study policy optimization for online episodic tabular Markov decision processes with unknown transition kernels, aiming for best-of-both-worlds guarantees together with data-dependent regret bounds. Recent work (Dann et al., 2023; Li et al., 2026) has shown that policy optimization can adapt to both adversarial and stochastic losses with first-order, second-order, and path-length bounds, but only under known transitions, leaving open whether such data-dependent guarantees are achievable by policy optimization when the transition kernel is unknown. We resolve this by developing a new algorithm based on optimistic follow-the-regularized-leader that attains these guarantees under unknown transitions. The key ingredient is a new design of optimistic $Q$-function estimators together with a data-dependent transition bonus that controls estimator bias through the loss-prediction error. Our analysis further identifies an unavoidable transition-dependent complexity term that captures the intrinsic cost of estimating the transition kernel. As a result, we obtain first-order, second-order, and path-length bounds with the transition-dependent complexity term while simultaneously achieving gap-dependent $\mathrm{polylog}(T)$ regret in the stochastic regime.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Machine Learning

2606.31769

Genre: Research Report (0.40)

Industry:

Leisure & Entertainment (0.67)
Media > Television (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.66)
Information Technology > Data Science > Data Mining > Big Data (0.46)

Add feedback

What Drives the Inlier-Memorization Effect? A Theory of Outlier Detection via Early Training Dynamics

Kim, Kunwoong, Kim, Dongha

arXiv.org Machine LearningJun-30-2026

Outlier detection (OD) aims to identify anomalous instances by learning the underlying structure of normal data (inliers), and is particularly challenging in fully unsupervised settings where no information about anomalies is available during training. Recent advances have leveraged the inlier-memorization (IM) effect, a phenomenon in which deep models memorize inlier patterns earlier than those of outliers, as a powerful signal for distinguishing outliers. However, despite its empirical success, the theoretical understanding of the IM effect remains limited. In this work, we present a theoretical study of the IM effect. Focusing on a simple autoencoder, we show that, under mild assumptions, the model can successfully memorize inliers while failing to memorize outliers during certain stages of early training. In particular, we characterize not only the emergence of the IM effect, but also its strength and persistence, and analyze how these properties depend on the data distribution and parameter initialization. In addition, building on these insights, we derive simple yet practical guidelines for enhancing the IM effect, including data preprocessing and parameter initialization schemes, achieving state-of-the-art performance on the ADBench datasets. Our findings provide a theoretical foundation for the IM effect and offer actionable directions for improving IM-based outlier detection methods.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Machine Learning

2606.29791

Country:

Europe (0.92)
North America > United States > California (0.27)

Genre: Research Report > New Finding (0.87)

Industry: Health & Medicine (0.69)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

TimeLAVA: Learning-Agnostic Valuation for Time Series Data

Liu, Wenqin, Quan, Weizhi, Zuo, Aoqi, Gao, Erdun, Nguyen, Vu, Sejdinovic, Dino, Bondell, Howard, Gong, Mingming

arXiv.org Machine LearningJun-30-2026

Data valuation quantifies the intrinsic quality of individual samples to enable principled data curation, quality control, and robust learning. For time series in critical domains such as healthcare, finance, and industrial monitoring, effective valuation methods are essential yet fundamentally lacking. Existing approaches are either model-dependent, limiting their generalizability, or designed for i.i.d. data and thus fail to capture temporal dependencies, multi-scale patterns, and non-stationary dynamics inherent to sequential data. We introduce TimeLAVA, a learning-agnostic framework that values temporal segments by their marginal contribution to minimizing distributional discrepancy between evaluated and reference data. At its core is a novel Selective Wavelet-based Wasserstein discrepancy combining multi-scale wavelet transforms for temporal localization with unbalanced optimal transport for robustness to distributional shifts. Segment values are efficiently computed via sensitivity analysis without requiring model training and aggregated into point-wise scores. We provide theoretical guarantees linking valuation to model-agnostic generalization and prove bounded sensitivity to outlier contamination. Extensive experiments across anomaly detection, data pruning, and label noise detection demonstrate that TimeLAVA produces significantly more informative value scores than existing methods on diverse real-world datasets.

data mining, learning-agnostic valuation, machine learning, (17 more...)

arXiv.org Machine Learning

2606.18729

Country:

North America (0.28)
Asia (0.28)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.70)
Information Technology > Data Science > Data Quality > Data Transformation (0.67)

Add feedback

The Fundamental Limits of Valid Transport Map Estimation

Balakrishnan, Sivaraman

arXiv.org Machine LearningJun-30-2026

Many modern generative modeling methods, including diffusion models, normalizing flows, and flow matching, estimate transport maps or plans between distributions without explicitly targeting an optimal transport (OT) map. In applications like generative modeling, the transport cost itself is irrelevant, and this makes it natural to target maps which are more tractable from either a statistical or computational standpoint. In this short note, we formalize the task of estimating any valid transport map in a rigorous minimax framework. One consequence of this framing is that it yields sample complexity lower bounds for any method whose learned object is evaluated as a transport map or plan, including flow matching and diffusion-based generative models, in settings where direct analysis would be challenging due to the analytic complexity of the methods and their target maps. We observe that, under standard, though strong, stability assumptions from the OT literature, estimating any valid transport map is statistically as hard as estimating the OT map. We complement these results with some examples showing that when these stability assumptions fail, alternative transport maps can be learned substantially more accurately than the OT map. Our minimax framing provides a rigorous foundation for understanding the statistical limits of modern transport-based generative methods and clarifies when targeting sub-optimal maps can provide real statistical advantages.

artificial intelligence, data mining, machine learning, (20 more...)

arXiv.org Machine Learning

2606.30574

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.40)

Add feedback

Adversarial Contamination Meets Hard Thresholding: An Iterative Algorithm with Signal Adaptivity and Minimax Optimality

Liu, Shixiang, Yang, Hanming

arXiv.org Machine LearningJun-29-2026

Pervasive data contamination -- stemming from measurement errors, outliers, or adversarial corruption -- has motivated the development of robust statistical methods. In this context, we propose a two-stage Adversarial Contamination-resistant Iterative Hard Thresholding (AC-IHT) algorithm for high-dimensional regression with contamination. Our nonconvex algorithm achieves minimax near-optimal (up to logarithmic terms) estimation by iteratively updating the coefficient vector and the contamination vector with different thresholding scales. We further demonstrate that our AC-IHT estimator is signal-adaptive: under proper signal conditions, it adaptively attains a sharper estimation rate and more accurate support recovery. Moreover, it enjoys the strong oracle property, laying a theoretical foundation for asymptotic inference. Numerical experiments confirm its superior finite-sample performance. Finally, we discuss theoretical extensions of the proposed procedure to generalized linear models and to heavy-tailed noise settings.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Machine Learning

2606.27685

Genre: Research Report > New Finding (0.65)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science > Data Mining (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.62)

Add feedback

Learning Probabilistic Filters with Strictly Proper Scoring Rules

Bach, Eviatar, Baptista, Ricardo, Bröcker, Jochen, Chen, Bohan, Stuart, Andrew

arXiv.org Machine LearningJun-26-2026

Bayesian filtering of partially and noisily observed dynamical systems seeks to infer the evolving conditional distribution of the state of a dynamical system, given observations, in an online fashion. This Bayesian filtering distribution is the natural object for uncertainty quantification, but it is rarely available as a supervised learning target. However, one can often use the forecast model to generate synthetic system trajectories, along with synthetic observations. We introduce the proper scoring ensemble filter (PSEF), an ensemble data assimilation method based on training an analysis map to approximate the filtering distribution using only synthetic state--observation trajectories. The analysis step is represented as a permutation-invariant, transformer-based map that takes as input a forecast ensemble and observations, producing an analysis ensemble. Training is based on strictly proper scoring rules -- with the energy score used in our implementation -- so that probabilistic accuracy is rewarded over the whole probability distribution. We prove that, under a realizability assumption, the population objective is minimized by the true Bayesian filtering distribution. We also derive the finite-ensemble empirical objective used in training and relate its single state--observation trajectory form to the population objective, using a mean-field consistency argument. Numerical experiments show that the learned filter accurately approximates challenging filtering distributions, including nonlinear, non-Gaussian, and multi-modal posteriors, and achieves stronger performance in data assimilation tasks than classical methods or learning-based methods with mean-squared-error objectives. For close-to-Gaussian problems, learning a correction to the EnKF is the best approach, while for highly non-Gaussian problems an end-to-end approach that discards this inductive bias is superior.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

2606.26497

Country:

North America > United States (1.00)
Europe (0.67)

Genre: Research Report > New Finding (0.65)

Industry:

Leisure & Entertainment > Games (0.63)
Government > Regional Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.92)
(2 more...)

Add feedback

British Police Built a Sprawling Crime-Prediction Machine. Some Results Couldn't Be Trusted

WIREDJun-25-2026, 10:00:00 GMT

British Police Built a Sprawling Crime-Prediction Machine. Some Results Couldn't Be Trusted As UK police embrace the AI revolution, a WIRED investigation reveals the messy inside story of one region's experiment with predictive analytics. The Think Family Database holds records on close to half a million people who live in the city of Bristol, England. For many years, few of them knew anything about it. Launched in 2016 by the Bristol City Council and the regional Avon and Somerset Police, the database has stored all manner of sensitive information--police intelligence reports, housing status, mental health records, teenage pregnancies, enrollment in parenting courses, free school meals. On top of this sensitive data, officials built machine-learning models to assign scores to thousands of adults and children. They hoped to build what they called a "picture of threat, harm, and risk" in the region. At an event in early 2022 to help officials tackle child exploitation crimes, one police data scientist described part of the approach this way: "I essentially dump all that data in a big bucket and stir it with a data-science spatula, and we come out with a lovely risk score for everybody." This risk scoring inside the Think Family Database was just one part of Avon and Somerset Police's sprawling predictive analytics program.

artificial intelligence, data mining, machine learning, (14 more...)

WIRED

Country: Europe > United Kingdom > England > Bristol (0.24)

Genre: Research Report (0.68)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Government > Regional Government > Europe Government > United Kingdom Government (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Latent Block-Diffusion Temporal Point Processes: A Semi-Autoregressive Framework for Asynchronous Event Sequence Generation

Zhang, Shuai, Chen, Yancheng, Zhou, Chuan, Liu, Yang, Lin, Xixun, Zhao, Xiangyu, Zhu, Jun, Ma, Zhi-Ming

arXiv.org Machine LearningJun-25-2026

Modeling and sampling from the underlying distribution of asynchronous event sequences are crucial in various real-world applications, including social networks, medical diagnosis, and financial transactions. Existing autoregressive methods suffer from error accumulation during multi-step generation, while non-autoregressive diffusion methods are typically limited to fixed-length output sequences. In this paper, we propose Latent Block-Diffusion Temporal Point Processes (LBDTPP), a novel semi-autoregressive TPP framework that introduces a latent block diffusion mechanism for high-quality and variable-length event sequence generation. The core idea is to define an autoregressive probability distribution over event blocks in latent space and perform Gaussian diffusion within each block. By sequentially generating blocks while simultaneously sampling events in each block, LBDTPP preserves the length flexibility of autoregressive TPPs and inherits the parallel high-quality generation capability of diffusion models. Theoretically, we derive Wasserstein error bounds showing that, under suitable local approximation and prefix-stability assumptions, block-wise generation can reduce error accumulation compared with event-wise autoregressive generation. Extensive experiments on six real-world benchmark datasets demonstrate that LBDTPP outperforms state-of-the-art TPP baselines in both unconditional and conditional generation tasks. Further empirical analyses verify the benefits of latent-space diffusion and block-wise generation, and reveal the trade-off between generation quality and block size. Our code is available at https://github.com/Zh-Shuai/LBDTPP.

data mining, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2606.24982

Country: Asia > China (0.28)

Genre: Research Report (0.82)

Industry:

Education (0.68)
Health & Medicine (0.48)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback